The R Script associated with this page is available here. Download this file and open it (or copy-paste into a new script) with RStudio so you can follow along.
ctrl-R (or command-R) to run current linecode in script and run ctrl-R (or command-R) to run selection
ctrl-1: script windowctrl-2: console windowTry to run today’s script without using your mouse/trackpad
dplyr and tidyrCheat sheets on website for Data Wrangling
library(dplyr)
library(tidyr)
Remember use install.packages("dplyr") to install a new package.
Data from US Bureau of Transportation Statistics (see ?nycflights13)
library(nycflights13)
Check out the flights object
head(flights)
Check out data structure with glimpse()
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600,…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5,…
## $ arr_time <int> 830, 850, 923, 1004, 812…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12…
## $ carrier <chr> "UA", "UA", "AA", "B6", …
## $ flight <int> 1545, 1714, 1141, 725, 4…
## $ tailnum <chr> "N14228", "N24211", "N61…
## $ origin <chr> "EWR", "LGA", "JFK", "JF…
## $ dest <chr> "IAH", "IAH", "MIA", "BQ…
## $ air_time <dbl> 227, 227, 160, 183, 116,…
## $ distance <dbl> 1400, 1416, 1089, 1576, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, …
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 20…
dplyr “verbs”select() and rename(): Extract existing variablesfilter() and slice(): Extract existing observationsarrange()distinct()mutate() and transmute(): Derive new variablessummarise(): Change the unit of analysissample_n() and sample_frac()-” Select everything but:” Select rangecontains() Select columns whose name contains a character stringends_with() Select columns whose name ends with a stringeverything() Select every columnmatches() Select columns whose name matches a regular expressionnum_range() Select columns named x1, x2, x3, x4, x5one_of() Select columns whose names are in a group of namesstarts_with() Select columns whose name starts with a character stringselect() examplesSelect only the year, month, and day columns:
select(flights,year, month, day)
select() examplesSelect everything except the tailnum:
select(flights,-tailnum)
Select all columns containing the string "time":
select(flights,contains("time"))
You can also rename columns with select()
select(flights,year,carrier,destination=dest)
filter() observationsFilter all flights that departed on on January 1st:
filter(flights, month == 1, day == 1)
This is equivalent to the more verbose code in base R:
flights[flights$month == 1 & flights$day == 1, ]
Compare with dplyr method:
filter(flights, month == 1, day == 1)`
Filter the flights data set to keep only evening flights (dep_time after 1600) in June.
filter(flights,dep_time>1600,month==6)
filter() is similar to subset() except it handles any number of filtering conditions joined together with &.
You can also use other boolean operators, such as OR (“|”):
filter(flights, month == 1 | month == 2)
Filter the flights data set to keep only ‘redeye’ flights where the departure time (dep_time) is “after” the arrival time (arr_time), indicating it arrived the next day:
filter(flights,dep_time>arr_time)
slice():slice(flights, 1:10)
arrange() rowsarrange() is similar to filter() except it reorders instead of filtering.
arrange(flights, year, month, day)
Base R method:
flights[order(flights$year, flights$month, flights$day), ]
desc()arrange(flights, desc(arr_delay))
Base R method:
flights[order(desc(flights$arr_delay)), ]
distinct(
select(flights,carrier)
)
Adds columns with calculations based on other columns.
Average air speed (miles/hour):
select(
mutate(flights,ave_speed=distance/(air_time/60)),
distance, air_time,ave_speed)
Performing multiple operations sequentially with a pipe character
With temporary objects:
a1 <- group_by(flights, year, month, day)
a2 <- select(a1, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
a3 <- summarise(a2,
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
a4 <- filter(a3, arr > 30 | dep > 30)
head(a4)
If you don’t want to save the intermediate results: wrap the function calls inside each other:
filter(
summarise(
select(
group_by(flights, year, month, day),
arr_delay, dep_delay
),
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
## Adding missing grouping variables: `year`, `month`, `day`
Arguments are distant from function -> difficult to read!
%>% allows you to pipe together various commands
x %>% f(y) turns into f(x, y)
So you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
group_by()Perform operations by group: mean departure delay by airport (origin)
flights %>%
group_by(origin) %>%
summarise(meanDelay = mean(dep_delay,na.rm=T))
Perform operations by group: mean and sd departure delay by airline (carrier)
flights %>%
group_by(carrier) %>%
summarise(meanDelay = mean(dep_delay,na.rm=T),
sdDelay = sd(dep_delay,na.rm=T))
Flights from which origin airport go the farthest (on average)? Hint: Group by airport (origin) then calculate the maximum flight distance (distance).
flights %>%
group_by(origin) %>%
summarise(meanDist = mean(distance,na.rm=T))
Which destination airport (dest) is the farthest (distance) from NYC?
flights %>%
arrange(desc(distance)) %>%
select(dest,distance) %>%
slice(1)
Which airport is that?
dplyr join methods
left_join(a, b, by = "x1") Join matching rows from b to a.right_join(a, b, by = "x1") Join matching rows from a to b.inner_join(a, b, by = "x1") Retain only rows in both sets.full_join(a, b, by = "x1") Join data. Retain all values, all rows.left_join(a, b, by = "x1") Join matching rows from b to a.

right_join(a, b, by = "x1") Join matching rows from a to b.

inner_join(a, b, by = "x1") Retain only rows in both sets.

full_join(a, b, by = "x1") Join data. Retain all values, all rows.

flights%>%
select(-year,-month,-day,-hour,-minute,-dep_time,-dep_delay)%>%
glimpse()
## Observations: 336,776
## Variables: 12
## $ sched_dep_time <int> 515, 529, 540, 545, 600,…
## $ arr_time <int> 830, 850, 923, 1004, 812…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12…
## $ carrier <chr> "UA", "UA", "AA", "B6", …
## $ flight <int> 1545, 1714, 1141, 725, 4…
## $ tailnum <chr> "N14228", "N24211", "N61…
## $ origin <chr> "EWR", "LGA", "JFK", "JF…
## $ dest <chr> "IAH", "IAH", "MIA", "BQ…
## $ air_time <dbl> 227, 227, 160, 183, 116,…
## $ distance <dbl> 1400, 1416, 1089, 1576, …
## $ time_hour <dttm> 2013-01-01 05:00:00, 20…
Let’s look at the airports data table (?airports for documentation):
glimpse(airports)
## Observations: 1,458
## Variables: 8
## $ faa <chr> "04G", "06A", "06C", "06N", "09J"…
## $ name <chr> "Lansdowne Airport", "Moton Field…
## $ lat <dbl> 41.13047, 32.46057, 41.98934, 41.…
## $ lon <dbl> -80.61958, -85.68003, -88.10124, …
## $ alt <int> 1044, 264, 801, 523, 11, 1593, 73…
## $ tz <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -…
## $ dst <chr> "A", "A", "A", "A", "A", "A", "A"…
## $ tzone <chr> "America/New_York", "America/Chic…
What is the name of the destination airport farthest from the NYC airports? Hints:
flights dataset and airports dataset.select(airports,
dest=faa,
destName=name)%>%
right_join(flights)%>%
arrange(desc(distance)) %>%
slice(1) %>%
select(destName)
## Joining, by = "dest"
flights dataThe section below shows some ‘advanced’ coding to extract the geographic locations for all flights and plotting. This is just meant as an example to illustrate how one might use these functions to perform a mini-analysis that results in a map.
library(geosphere)
library(maps)
library(ggplot2)
library(sp)
library(rgeos)
## rgeos version: 0.4-2, (SVN revision 581)
## GEOS runtime version: 3.6.1-CAPI-1.10.1
## Linking to sp version: 1.3-1
## Polygon checking: TRUE
data=
select(airports,
dest=faa,
destName=name,
destLat=lat,
destLon=lon)%>%
right_join(flights)%>%
group_by(dest,
destLon,
destLat,
distance)%>%
summarise(count=n())%>%
ungroup()%>%
select(destLon,
destLat,
count,
distance)%>%
mutate(id=row_number())%>%
na.omit()
## Joining, by = "dest"
NYCll=airports%>%filter(faa=="JFK")%>%select(lon,lat) # get NYC coordinates
# calculate great circle routes
rts <- gcIntermediate(as.matrix(NYCll),
as.matrix(select(data,destLon,destLat)),
1000,
addStartEnd=TRUE,
sp=TRUE)
rts.ff <- fortify(
as(rts,"SpatialLinesDataFrame")) # convert into something ggplot can plot
## join with count of flights
rts.ff$id=as.integer(rts.ff$id)
gcircles <- left_join(rts.ff,
data,
by="id") # join attributes, we keep them all, just in case
Now build a basemap using data in the maps package.
base = ggplot()
worldmap <- map_data("world",
ylim = c(10, 70),
xlim = c(-160, -80))
wrld <- c(geom_polygon(
aes(long, lat, group = group),
size = 0.1,
colour = "grey",
fill = "grey",
alpha = 1,
data = worldmap
))
Now draw the map using ggplot
base + wrld +
geom_path(
data = gcircles,
aes(
long,
lat,
col = count,
group = group,
order = as.factor(distance)
),
alpha = 0.5,
lineend = "round",
lwd = 1
) +
coord_equal() +
scale_colour_gradientn(colours = c("blue", "orange", "red"),
guide = "colourbar") +
theme(panel.background = element_rect(fill = 'white', colour = 'white')) +
labs(y = "Latitude", x = "Longitude",
title = "Count of Flights from New York in 2013")
## Warning: Ignoring unknown aesthetics: order

This tutorial has been forked from awesome classes developed by Adam Wilson here: http://adamwilson.us/RDataScience/
This exercise based on code from here.